Squeeze performance out of the (post)processing stage of PDF build. by KubaO · Pull Request #154 · twinbasic/documentation

KubaO · 2026-05-24T12:04:18Z

(Post)Processing stage reads the PDF that Chrome wrote out, modifies it (mostly a fresh outline i.e. bookmarks), and writes it out. Our own outline processing takes about 10ms, most of the time (~1s) is pdf-lib's loading and then saving of the PDF.

The heap use and CPU time have been squeezed down as much as possible without changing pdf-lib's API.

PDFObjectParser.parseDict ends every dict it parses with four PDFName.of calls for Type/Catalog/Pages/Page, even on the dicts that have no /Type entry at all. With fast-decode-name in effect each call collapses to a fastCache.get, but fastOf was still the twinbasic#4 row in process.cpuprofile at ~5%. Pool-dedup makes the canonical PDFNames reference-stable for the whole load, so capture them once at shim-load and substitute module-level constants for the four calls. Drops ~17 ms (~22%) of fastOf self-time. Output byte-equivalent.

… scans.

The sampling heap profile of the process phase showed `new Map()` + Map.prototype.set at ~80 MB combined (50 % of total allocations on the book), 80 % of that traffic from the parser's per-dict accumulator. PDF dicts are tiny (typically <= 10 entries), so the hash-table arena per dict was pure overhead. The new fast-dict-array shim patches PDFDict's storage to a flat alternating array, plus all prototype methods (set/get/has/delete/ keys/values/entries/asMap/clone/toString/sizeInBytes/copyBytesInto) and the parser's parseDict hot loop. Subsumes fast-dict-iter + fast-parse-dict, both of which stay in the tree as A/B baselines. Wall-clock: process phase 1.18s -> 1.13s (4 runs paired no-profile). Heap: Map+set builtin traffic 79 MB -> 15 MB.

After fast-dict-array shipped, PDFContext.assign's `indirectObjects.set(ref, object)` was the only hot Map.set left in the heap profile -- one set per parsed indirect object, ~14 MB of hash-table growth on the book. fast-indirect-objects patches assign / lookup / lookupMaybe / delete / getObjectRef / enumerateIndirectObjects to consult a dense array keyed by objectNumber for gen=0 PDFRefs (the overwhelming common case), Map fallback for gen!=0. Lazy-init on first assign; no constructor patch needed. Mirror of the fast-refs trick on the value side. CPU: PDFContext.assign drops out of the process top-15. Heap: set traffic 14.8 MB -> 7.7 MB (-48 %). Remaining 7 MB is the upstream PDFRef pool.set on cache miss -- next target.

fast-refs' dense-array cache already short-circuited the LOOKUP side, but on a gen=0 miss it still called through to upstream PDFRef.of, which redundantly populated the upstream Map<string, PDFRef> pool. After fast-indirect-objects shipped, that pool was the last hot Map.set in the heap profile -- ~7 MB of growth-arena churn from ~9 k unique-objectNumber misses on the book. Replace the gen=0 miss path with Object.create(PDFRef.prototype) + manual field init. PDFObject (super) has a no-op constructor and the only fields prototype methods read are objectNumber / generationNumber / tag, so direct construction is safe. CPU: PDFRef.of drops out of process top-15 (~93 ms saved). Heap: set traffic 7.7 MB -> 0.5 MB. The residual 504 KB is PDFName interning's fastCache.set, static-size and harmless. No more materially-hot Map.set in the process-phase heap profile.

After the previous round of allocator-shape shims, parseNumberOrRef was the next-largest row in the process-phase heap profile at ~15 MB -- mostly inlined `new PDFNumber(value)` from the parser's number branch. PDFs reuse a handful of integer values constantly (page indices, /Count, /N, /MediaBox dimensions), and PDFNumber is conceptually immutable, so pooling by value is safe. fast-pdfnumber-pool installs a dense-array cache for non-negative integers in [0, 16384), Map fallback for floats / negatives. Same shape as fast-refs. parseNumberOrRef's row collapses off the top 10; total process-phase heap traffic drops 123 MB -> 107 MB (-13 %). PDFNumber row settles at 0.8 MB (the floor: one instance per unique value). Also lands find-heap-callees.mjs -- the children-of analyzer used to investigate fastParseDictArray's mystery 58 MB self-row (turned out to be recursive parseDict invocations across nesting levels, intrinsic).

Instrumented parseDict shows 261k invocations on the book, 52% of dicts at exactly 5 entries (10 push slots), 80% at 4-5 entries, 96% <= 7 entries, max recursion only 3. The original `const arr = []` + push-grow path was wasting ~85% of fastParseDictArray's 58 MB self-row on FixedArray growth garbage (cap 4 -> 8 -> 16 doubling discards on every 5-entry dict). Allocate the accumulator at `new Array(10)` -- exact fit for the median, 0 growth for 80% of dicts, one growth for the 5% at 7 entries. Direct indexing with len counter; push only on overflow. Heap: total sampled 107 MB -> 92 MB (-14%); fastParseDictArray row 58 MB -> 44 MB (-25%). SCRATCH=10 beat SCRATCH=16 (saved 9.5 MB more) because the cap-16 baseline was itself ~46 MB across 261k calls. Also lands instrument-parsedict.mjs and the --instrument-parsedict flag on measure.mjs for future dict-workload investigations.

SCRATCH was a leftover name from when I was sketching the parser-wide long-lived buffer approach (which we rejected). The shipped const is just the initial capacity of the per-dict backing array -- not a scratch buffer at all. Rename for clarity. Also softens the notes' "What about a scratch buffer?" follow-up section title to "What about a true scratch buffer?", and notes in the code comment that this is a pre-sized permanent backing array, not scratch.

Each PDFDict would carry a (buf, start, end) view into a parser-wide per-depth shared array that is append-only across all dicts at that depth. Eliminates per-dict array allocation: 261k PDFDict-backing- arrays collapse to 3 shared buffers per parser (max parseDict recursion depth = 3 on the book). Per-depth caps pre-sized to the book's measured workload + slack, so V8 doesn't grow them. Mutations (catalog.set during setOutline) copy-on-write to a private array. Recursion handling: a single global shared buffer would interleave outer and inner entries when parseObject recurses; per-depth buffers keep outer's range contiguous. Bug found and fixed during prototyping: `if (!this._dictDepth)` re-init guard fires every time _dictDepth returns to 0 (between top-level dicts), defeating buffer sharing. Use explicit `_dictBufs === undefined` check. Net win is modest: ~2.5 MB heap reduction vs fast-dict-array (92.13 MB -> 89.68 MB). The buffer-sharing savings (~88 B/dict) are largely offset by the larger PDFDict instance (Object.create + 5 fields vs constructor-inlined + 2). Superseded by the one-buffer + packed-pointer approach that follows, which shrinks the PDFDict instance back down. Code dropped; narrative kept in perf/notes/08-pdf-lib.md as the thinking that led to one-buffer.

The fast-dict-view PDFDict instance was 5 fields (~96 B); packing the lot into a single 53-bit Number `d` would shrink the instance significantly. Reads via bitwise (fields below bit 32) or arithmetic (Math.floor(d / 2**n) & mask) for higher fields. The PDFContext is a singleton in our pipeline (one PDFDocument.load per process), so the shim keeps it at module level; a second distinct context throws. PDFDict instance shrinks ~96 B -> ~24-32 B. PDFPageLeaf still needs `normalized` + `autoNormalizeCTM` slots (~1.6 k page leaves of 261 k total dicts, small fraction). Heap: 89.7 MB -> 83.7 MB (-6 MB / -7%). GC self-time: 167 ms -> 129 ms (-23%). Cumulative arc from the original Map-backed PDFDict: 152 MB -> 84 MB (-45%). Superseded by the one-buffer PDFDict approach that follows: keeps the "packed into a Number" idea but moves entries into a single per-parser mainBuf, folding bufIdx away and giving a tighter bit layout. Code dropped; narrative kept in perf/notes/08-pdf-lib.md.

Collapse PDFDict storage into one long-lived mainBuf. Recursion gotcha solved with a two-area split: a small per-parser temp array acts as a stack of parseDict recursion frames; each frame appends to temp, then commits its frame to main in one contiguous append, then pops temp back. Outer's frame stays parked in temp while inner recurses, then resumes intact when inner pops. Owned dicts (factory-created post-parse, COW results) append to mainBuf too. Mutations: in-place replace for existing keys, COW (copy range to tail + push new pair) for new keys or delete. setOutline's pattern (create-then-recurse-then-set) hits one COW per dict; subsequent sets extend in place at the high-water mark. ~9k entry copies total for the book, negligible. PDFContext is a singleton in our pipeline; a second distinct context throws. Instance state packs into one 53-bit Number: 24 bit start + 14 bit length + 1 bit owned + 14 spare = 37 bits used. Heap: total process heap 92 MB -> 66 MB (-28% vs fast-dict-array). Cumulative arc since Map-backed PDFDict: 152 MB -> 66 MB (-57%). GC self-time bumps slightly (one big live array to scan); wall-clock within noise. Mutually exclusive with --fast-dict-array / --fast-dict-iter / --fast-parse-dict. Production swaps render-book.mjs's fast-dict-array import for fast-dict-onebuf; the legacy shims stay in the tree as A/B baselines. The intermediate fast-dict- view and fast-dict-double prototypes explored on the way to this shape are documented as "explored, didn't ship" sections in perf/notes/08-pdf-lib.md.

Walks the PDF grammar (indirect objects, dicts, arrays, names, numbers, refs, strings, streams, ObjStms-with-inflate) without instantiating any PDFObject. Counts only: 261k dicts, 2.34M dict slots, 81k arrays, 750k ref appearances, max recursion depth 4. The measure pass is preparation for a two-pass measure-allocate- work architecture where mainBuf becomes a Float64Array sized exactly to measured demand, eliminating V8's mark traversal of its 2.4M Object-ref slots. On perf/raw.pdf (39.3 MB Chrome output), measure pass runs in 135 ms (min of 5) vs PDFDocument.load at 1238 ms -- ratio 0.109, ~9x cheaper. Architecture cleared: even at 80% work-pass cost, 135 + 990 = 1125 ms vs current 1238 ms is net win on CPU before any GC reduction. Wired: - perf/measure.mjs --dump-raw-pdf <path>: one-time flag that saves Chrome's raw page.pdf() output before pdf-lib processing. - perf/raw.pdf (gitignored): canonical 39.3 MB input for measure / heap-profile investigations going forward. - perf/phase0-measure.mjs: the prototype walker. Measurement-only; doesn't ship in any production path.

docs/lib/measure-pass.mjs productionises the Phase 0 walker as a stand-alone library exporting measure(bytes) -> counts. fast-dict-onebuf gains setExpectedDictSlots(slots) that resizes the module-level main backing Array to exact measured demand. perf/measure.mjs gains --measure-pass that wires the two together before PDFDocument.load, with mutex checks against --incremental, --render-only, and the (required) --fast-dict-onebuf. Structural validation: byte-identical output (1651 pages, 1773 outline nodes, matching titles; 31-byte rawPdf-timestamp jitter on the saved bytes). A V8 inline-cache gotcha worth capturing: the first cut reassigned the module binding (`let main; main = new Array(N)`) which broke IC slots in every hot closure that read main. Heap profile showed _appendEntries leaking 27 MB and total sampled jumping 65 -> 92 MB, despite the resized array being identical in shape. Pre-filling with arr.fill(null) didn't help (wasn't an element-kind issue). Fix: keep the same Array identity, resize in place via `main.length = N`. Heap regression collapses to +0.14 MB noise. Lesson recorded in notes/08-pdf-lib.md: never rebind a module-level value that hot closures specialise against, even if language semantics allow it -- mutate in place. Measured cost (paired, production shim set): measure-pass: +60 ms (inline; 135 ms standalone Phase 0) load: unchanged (within noise) net process: +40 ms Heap: flat. Phase 1 doesn't change what gets allocated, only the initial capacity of the backing Array (which is module-load-time cost, invisible in process-phase profiles). Behind --measure-pass flag in the harness. NOT yet in docs/render-book.mjs's production import chain -- no current consumer wins anything back from the measured counts. The flag exists so a later commit can flip it into production once the architecture has another consumer.

The owned/shared flag at bit 38 only gated set's append path: "extend in place at HWM only if owned." Re-reading the safety argument shows this is over-cautious. Each parseDict commits a contiguous frame to main and mainLen advances past it; no two PDFDict instances share slots. So if a dict's range satisfies start + length === mainLen, the slots past mainLen are free regardless of how the range was created. The owned/shared distinction the bit encoded doesn't correspond to anything the safety check needs. Changes: - pack(start, length): third arg gone; no POW_38 OR-in. - _owned and POW_38: deleted. - _cow: collapses to one branch (was two paths that differed only in the owned-at-HWM early return). - set: gate becomes just `start + length !== mainLen`. - _makeFromRange: owned param gone. - _ownedFromArray renamed _makeFromAppend for accuracy. - Bit 38 is now spare; spare grows from 14 to 15 bits. Net behavioural change: shared dicts that still abut the HWM at first .set now extend in place instead of COWing -- ~5-10 slot copies avoided per such mutation. Tiny win in the right direction. Byte-identical output (1651 pages, 1773 outline nodes, matching titles). Heap flat: 65.34 MB vs 65.27 MB baseline, within noise. Top frames in the heap profile are structurally the same.

Measurement-only tooling for Phase 2 design. fast-dict-onebuf exports `main` and a `getMainLen()` getter so external consumers can inspect the buffer. perf/instrument-slot-types.mjs walks main[0..mainLen) after setOutline and classifies each slot by PDFObject subtype, printing key/value counts and percentages. perf/measure.mjs gains --instrument-slot-types that loads the module and invokes the classifier (requires --fast-dict-onebuf; not compatible with --incremental / --render-only). Distribution on the book (production shim set + --measure-pass): total slots = 2 358 630, keys = 1 179 315, values = 1 179 315. type keys key% values value% total% ----------------------------------------------------------------- PDFName 1179315 100.00% 493256 41.83% 70.91% PDFRef 0 0.00% 435217 36.90% 18.45% PDFNumber 0 0.00% 162325 13.76% 6.88% PDFArray 0 0.00% 79468 6.74% 3.37% PDFDict 0 0.00% 5660 0.48% 0.24% PDFHexString 0 0.00% 1776 0.15% 0.08% PDFString 0 0.00% 1601 0.14% 0.07% PDFBool.True 0 0.00% 12 0.00% 0.0005% PDFBool.False 0 0.00% 0 0.00% 0 PDFNull 0 0.00% 0 0.00% 0 Key findings: (1) keys are 100% PDFName -- the even/odd invariant holds. (2) Four big pools (Name, Ref, Number, Dict) cover 96.4% of all slots; encoding them directly into the Float64 mainBuf collapses ~96% of slot-mark traversals. (3) Side-pool fallback for unpooled types (Array, String, HexString) is ~3.5% -- ~82 800 slots that V8 would still mark, vs ~2.34M today. (4) Nested PDFDicts as slot values are only 5 660 -- most dicts are referenced via Ref rather than embedded. (5) Bool/Null/RawStream in dict slots are essentially zero; tag-only encoding works. Classification cost: 39ms (single pass over 2.36M slots).

The next architectural step from Phase 1 -- replace the Object[] mainBuf with a Float64Array, encode every entry (key and value alike) as a 4-bit type tag + 49-bit pool id / payload. Subsumes fast-refs and fast-pdfnumber-pool by owning PDFRef.of and PDFNumber.of with built-in pool-id assignment; adds new pools for PDFArray (sequential id), PDFString and PDFHexString (value-dedup since they're immutable). PDFDict slots encode the existing 38-bit (start, length) payload directly. A trap worth recording: the first cut eagerly cached every parse-created PDFDict in dictByPayload so decodeValue(TAG_DICT) would return the same instance. That writes 261k Map entries during parse; total heap went 65 -> 92 MB. Fix: lazy materialization. Top-level dicts (226k) live in indirectObjects and are never decoded via TAG_DICT; only nested dicts (~5660) are. Caching on first access caps the Map at ~5660 entries. Measured result on faraday: wash. process wall 1.16 s -> 1.18 s (~+20 ms, noise) GC self-time 151 ms -> 149 ms (~0 ms) heap allocation 65 MB -> 68 MB (+3 MB from new pool Maps) marked main slots 2.34 M -> 0 (architectural win, no $$) The slot-mark-cost win is real but mainBuf wasn't the bottleneck -- pointer-array marks are fast in V8. Encoding overhead roughly cancels the savings. Code dropped. Faraday kept it as opt-in foundation for Phase 3, but Phase 3 also doesn't ship, so the dependency chain doesn't earn its keep on staging. Design notes preserved in perf/notes/08-pdf-lib.md as the takeaway worth keeping.

Mirrors Phase 2's PDFDict shape onto PDFArray: each instance becomes a view into a shared arrayBuf Float64Array via this.d = packed (start, length), with the same temp-then-commit parseArray pattern and the same 4-bit-tag + 49-bit-payload slot encoding. Bit budget: 24 + 15 = 39 bits per array (one more length bit than dict). Measured result on faraday: heap win + CPU regression. process duration 1.09 s -> 1.45 s (+360 ms, +33 %) GC self-time 149 ms -> 144 ms (flat) heap sampled 65 MB -> 58 MB (-7.6 MB, -12 %) parseArray row 19.6 MB -> 0 (out of top 10) structural byte-identical (1651 pages, 1773 outline) The heap win is real: 79k PDFArrays stop allocating backing arrays. The CPU regression is mostly per-slot decode during save -- PDFDict.copyBytesInto + PDFArray.copyBytesInto together iterate ~3M slots, each calling decodeValue (10-case switch + pool lookup). V8 doesn't inline decodeValue across the prototype boundary; ~100 ns x 3M = ~300 ms. Code dropped (depended on Phase 2's fast-dict-encoded.mjs, also not on staging). Design notes preserved in perf/notes/08-pdf-lib.md; the follow-up Phase 3β attempts to recover most of the 300 ms by hand-inlining the common decode cases at the hot call sites.

…t ship. The Phase 3 CPU regression was almost entirely per-slot decode during save -- PDFDict.copyBytesInto + sizeInBytes + PDFArray.copyBytesInto + sizeInBytes iterate ~3M slots, each calling decodeValue (10-case switch + pool lookup). V8 doesn't inline across the prototype-method boundary; ~100 ns x 3M ~= +300 ms. (β) hand-inlines decodeValue's switch into all four hot methods. The switch body is copy-pasted verbatim into each loop, giving V8 a monomorphic call site per case branch. Measured deltas vs Phase 3 (pre-inline) on the book: (garbage collector) 144 ms -> 130 ms (-14 ms win) PDFObjectParser.parseName 106 ms -> 70 ms (-36 ms win, surprise) PDFDict.copyBytesInto 57 ms -> 49 ms (-8 ms) fastParseDictEncoded 59 ms -> 63 ms (+4 ms) heap sampled 57.8 MB ~ 58.0 MB (flat) structural byte-identical (1651 pages, 1773 outline) parseName's drop is a downstream effect of V8 re-optimizing the call graph once the hot copyBytesInto / sizeInBytes paths became monomorphic per case branch. Net of full Phase 2 + 3 + β vs P1 baseline (fast-dict-onebuf): heap 65.4 MB -> 58.0 MB (-7.4 MB, -11 %) GC self-time 149 ms -> 130 ms (-19 ms, -13 %) CPU residual ~+200 ms across many frames (noisy) Architectural conclusion: Float64Array encoded storage works correctly and delivers a real heap+GC win, but the per-slot encoding overhead exceeds the slot-mark savings. V8 marks pointer arrays faster than assumed (~10-20 ms for 2.4M slots, not 100+). The original Object[] polymorphic .copyBytesInto() was actually fine; replacing it with explicit switch dispatch helps GC and parseName but hurts dict-side hot loops. Code dropped (depended on Phase 2 / Phase 3 architecture, also not on staging). Notes preserved in perf/notes/08-pdf-lib.md as the endpoint of this storage-shape exploration. The next move on the same theme is the much narrower "one-buffer for PDFArray" (fast-array-onebuf), which mirrors fast-dict-onebuf's shape directly and does ship.

Wires docs/lib/measure-pass.mjs in front of PDFDocument.load in docs/render-book.mjs. The no-allocate walker counts dictSlots once, fast-dict-onebuf pre-sizes its main Array to the exact count (main.length = N), and V8 growth resizes during load go away. Net wall-clock on the book is ~+40 ms (walker ~60 ms, load saves ~20). That's the smallest of the four Phases evaluated and the only one whose tradeoff is acceptable shipping: Phase 2 is a regression, Phase 3 / 3β recover most of it but only for a ~7 MB heap win that doesn't justify the CPU cost. Phase 1's bound is mostly insurance -- mainBuf was over-allocating by ~60 K slots out of 2.4 M -- but it lays the plumbing for any future shape change to ship without re-doing the wiring. Also adds --measure-pass to both canonical commands in perf/README.md and a flag-rationale entry parallel to fast-dict-onebuf's.

Mirror of fast-dict-onebuf's strategy applied to PDFArray. Every committed element lives in a single append-only `arrayMain` JS Array, kept for the document's lifetime. Each PDFArray instance is a view via packed (start, length) in `d`. Per-instance `this.array = []` allocation goes away; ~79 k PDFArrays stop allocating per-instance backing arrays + grow doublings. Storage is a plain heterogeneous JS Array -- slots hold the original PDFObject references, reads are `arrayMain[start + i]` with no decode. This is the explored-but-didn't-ship Phase 3 shape minus the Float64Array encoding (which cost ~300 ms of decodeValue dispatch on save's copyBytesInto across ~3 M slots). The plain-reference shape skips that entirely. Parser uses a per-parser _arrayTemp + length cursor as a recursion stack, parallel to fast-dict-onebuf's _dictTemp; each parseArray invocation pushes onto temp, commits its frame to arrayMain in one contiguous append, and pops temp back. Dict / array temps are independent so cross-recursion is fine. Mutations: in-place replace for set, in-place extend at HWM for push, COW for insert / remove / push-not-at-HWM. Same at-HWM safety logic as fast-dict-onebuf; no owned bit needed. Bit budget: 24-bit start (16 M slots) + 16-bit length (65 536 elements, max observed ~25 k on the book) = 40 bits, well within Number.MAX_SAFE_INTEGER. Singleton context is duplicated (10 lines) rather than shared with fast-dict-onebuf -- each shim stays independently injectable. Production wiring: - docs/render-book.mjs: import setExpectedArraySlots; call after measure-pass before PDFDocument.load. - perf/measure.mjs: --fast-array-onebuf flag. Composes with --fast-dict-onebuf. --measure-pass also drives setExpectedArraySlots when both shims are on. - perf/README.md: --fast-array-onebuf in both canonical commands, flag-rationale entry, run.bat row, What-shipped + Investigation-log. Heap impact (process phase, 512 B sampling, fast-array-onebuf on vs the immediate predecessor baseline): total sampled 65.6 MB -> 51.9 MB (-13.7 MB, -20.8 %) parseArray row 19.6 MB -> 0 (out of top 15) new shim row - -> 4.2 MB (PDFArray wrappers) CPU impact (process wall, pinned 0x5500 / High, no profiler, 3 paired runs): P1 only median 1.07 s, mean 1.09 s P1 + this median 1.02 s, mean 1.01 s Δ mean +0.08 s (this shim slightly faster, within noise) The CPU regression that showed up under --cpu-profile-process was profiler-induced noise; gone once we pin and drop the sampler. Cumulative heap arc since Map-backed PDFDict: 152 MB -> 52 MB (-66 %). The endpoint of the dict + array allocator refactor.

Upstream caches `<obj> <gen> R` on each PDFRef so toString / sizeInBytes / copyBytesInto can read it back. After fast-array-onebuf shipped, the heap profile showed PDFParser.parseIndirectObjectHeader at 13.7 MB / 25 % of total -- attribution chain (via perf/find-heap-callers.mjs): parseIndirectObjectHeader → skipJibberish (14.2 MB) → matchIndirectObjectHeader (try/catch wrapper) → parseIndirectObjectHeader → fastOf skipJibberish runs after every successful indirect object parse and speculatively calls matchIndirectObjectHeader to detect the next `N M obj` header. On valid PDFs it always succeeds. fastOf fires once per indirect-object boundary, populating the dense-array cache; the subsequent "real" parseIndirectObject is a cache hit. V8 inlines fastOf at this call site (small + hot from speculation), so the attribution lands on the caller -- 13.7 MB of which was the tag-string churn (`objectNumber + ' 0 R'`): V8 builds 1-2 intermediate concat strings + the final ~25-35 B tag, ~150 k times. Eliminating the field collapses both rows: parseIndirectObjectHeader 13.7 MB -> 9.3 MB (-4.3 MB) fastOf (refs) 7.7 MB -> 4.8 MB (-2.9 MB) total sampled 51.9 MB -> 45.2 MB (-6.7 MB, -13 %) parseArray row gone (already collapsed by fast-array-onebuf) The remaining 9.3 MB at parseIndirectObjectHeader and 4.8 MB at fastOf are the PDFRef instances themselves (Object.create + objectNumber + generationNumber fields, ~32-48 B × ~150 k) plus attribution leakage from V8's inlining. Hard floor without dropping per-PDFRef wrappers entirely. Prototype methods now compute results from objectNumber / generationNumber directly: - copyBytesInto: writes digits straight into the output buffer with a no-allocation _writeUint helper (divide-and-write-backwards into the caller's buffer). No copyStringIntoBuffer call. - sizeInBytes: returns digitCount(obj) + digitCount(gen) + 3. - toString: builds on demand. Debug only, no caching needed. Both gen=0 (no tag set) and gen!=0 (tag set by upstream's constructor but ignored) work; the gen!=0 path's tag string is allocated-then-wasted (~18 % of refs, ~1 MB), not worth patching the constructor to avoid. CPU impact (pinned 0x5500 / High, no profiler, 4 runs each): with-tag median 1.045 s, mean 1.045 s tagless median 1.030 s, mean 1.030 s Δ ~15 ms tagless faster (in the noise but trending) Output PDF is byte-identical modulo /CreationDate + /ModDate timestamps (verified by inflating + diffing all 453 ObjStm streams).

On valid PDFs, the byte after each successful parseIndirectObject + skipWhitespaceAndComments is almost always a digit -- the start of the next `N M obj` header. skipJibberish only exists to recover from invalid PDFs that wedge garbage between indirect objects, but its hot path runs unconditionally: ~150 k calls per load on the book, each speculatively trying matchKeyword(xref/trailer/startxref) (all fail on a digit) and then matchIndirectObjectHeader (a try/catch around parseIndirectObjectHeader → parseRawInt × 2 → matchKeyword('obj') → fastOf round-trip). The speculation succeeds every time, the cursor gets rewound, and the outer while loop's IsDigit check confirms what the speculation already proved. Short-circuit when the cursor is on a digit; fall through to skipJibberish on anything else (xref / trailer / startxref keyword starts, or real jibberish between indirect-object sections). The once-per-section skipJibberish in parseDocumentSection (after maybeParseTrailer) is unaffected -- it handles boundaries between PDF revisions / EOF where stray bytes are spec-legal. Wall-clock impact (pinned 0x5500 / High, no profiler, 4 paired runs): without fast-path median 1.07 s, mean 1.053 s with fast-path median 0.995 s, mean 0.985 s Δ ~67 ms faster (mean), ~6 % of process phase Phase breakdown isolates the win to load (mean 0.518 → 0.455 s, -62 ms); save is flat as expected (fast-path is load-side only). Heap unchanged (0 MB delta, as predicted) -- the PDFRef instances the speculation allocated were already attribution-shifted to the real parseIndirectObject's cache miss, not new allocations. Output PDF byte-identical to the pre-patch baseline (verified by inflating + diffing all 453 ObjStm streams modulo timestamps).

… process). fast-refs builds PDFRef instances via `Object.create(PDFRef.prototype) + fresh.objectNumber = ... + fresh.generationNumber = ...`. V8 transitions the hidden class on each property write and routes the result through the slow-property path. Empirically on the book that's ~60 B per pooled PDFRef, vs ~31 B for PDFName (built via `new PDFName(...)` -- a real constructor with stable hidden class from the first instance). This shim swaps `Object.create + writes` for a plain function used as a constructor that sets both fields in one shot. Aliasing `_FastRef.prototype = PDFRef.prototype` keeps `instanceof PDFRef` satisfied and resolves all prototype methods on the shared prototype (no extra proto-chain hop). gen != 0 still falls back to upstream PDFRef.of's Map-based pool (rare on freshly-parsed PDFs). Measured on the book (paired heap + cpu profile, --fast-refs vs --fast-refs-class with the rest of the production shim set on): Heap (sampled total) 45.26 MB -> 41.39 MB (-3.87 MB, -8.5 %) fastOf / fastClassOf row 4 696 KB -> 3 435 KB (-1 261 KB) create (builtin) 3 379 KB -> 2 627 KB (- 752 KB) parseIndirectObjectHeader row 9 115 KB -> 7 435 KB (-1 680 KB) Per-PDFRef savings: ~16 B/instance × 226 k unique = ~3.7 MB. Not the full 30 B-to-PDFName-floor (PDFRef carries 2 fields vs PDFName's 1), but a clean win and the construction-style change applies symmetrically to the other Object.create-built shapes (fast-dict-onebuf._makeFromRange, fast-array-onebuf._makeFromRange) for the next round. Process wall-clock 1.13 s -> 0.99 s (-140 ms, -12 %) load 0.52 s -> 0.47 s save 0.51 s -> 0.44 s fastOf (PDFRef) self-time 28 ms -> (out of top 15) GC self-time barely moved (87 ms -> 82 ms), consistent with the allocation-rate drop being modest relative to mark-cost (the live fast-dict-onebuf mainBuf still dominates the GC bill). fast-refs.mjs stays in the tree as an A/B baseline. measure.mjs mutex-checks --fast-refs and --fast-refs-class so the wrong one can't be loaded silently. render-book.mjs swaps the import: production runs through fast-refs-class now.

Same shape change fast-refs-class applied to PDFRef, now applied to the PDFDict factory paths. Replaces `Object.create(ProtoClass.prototype) + pd.d = ... [+ pd.normalized = false + pd.autoNormalizeCTM = true]` in _makeFromRange and clone with one plain-function constructor per subclass (`_FastDict`, `_FastCatalog`, `_FastPageTree`, `_FastPageLeaf`), each with the right field assignments in its body and its prototype aliased to the upstream prototype so instanceof + method dispatch are unchanged. Unknown subclasses fall back to the original Object.create path (defensive; nothing in our pipeline hits it). Measured on the book (paired heap + cpu profile, fast-refs-class baseline vs + this change): Heap (sampled total) 41.39 MB -> 35.41 MB (-5.98 MB, -14.4 %) _makeFromRange (dict) 16 484 KB -> 11 404 KB (-5 080 KB) create (builtin) 2 627 KB -> 921 KB (-1 706 KB) _FastDict (new attribution row) — -> 621 KB Per-PDFDict saving: ~20 B/instance × 260 k = ~5.2 MB. Matches the delta on the row plus the builtin's drop minus the new constructor-frame attribution. Cumulative since fast-refs-class: total sampled 45.26 MB -> 35.41 MB = -9.85 MB (-22 %) over two shape-change commits. CPU is roughly flat (process 0.99 s -> 1.03 s under cpu profile, within noise). GC self-time +18 ms (82 -> 101 ms), consistent with the existing fast-dict-onebuf trade-off documented in perf/README.md: the dominant GC cost is the live mainBuf scan, not allocation rate, so cutting allocation doesn't cut mark time. The allocation-rate reduction still matters for sustained-load memory pressure even when it doesn't move single-shot wall-clock. Output PDF byte-identical modulo /CreationDate + /ModDate timestamps (no content path touched, only the JS object shape used to wrap the parsed dict range).

…130 ms process). Same shape change applied to PDFArray's factory paths -- replace `Object.create(PDFArray.prototype) + pa.d = ...` in _makeFromRange and clone with a `_FastArray` plain-function constructor, prototype aliased to PDFArray.prototype. No subclass dispatch needed (PDFArray has none in pdf-lib, unlike PDFDict). Measured on the book (paired heap + cpu profile, prior commit's dict-class baseline vs + this change): Heap (sampled total) 35.41 MB -> 33.68 MB (-1.73 MB, -4.9 %) fastParseArrayOneBuf row 4 372 KB -> 3 334 KB (-1 038 KB) create (builtin) 921 KB -> (out of top 15) -921 KB Process wall-clock 1.03 s -> 0.90 s (-130 ms, -13 %) GC self-time 100.90 ms -> 58.69 ms (-42 ms) Per-PDFArray saving: ~22 B/instance × ~80 k = ~1.7 MB. Matches the row delta + builtin drop. Surprise win on GC + wall-clock: cuts GC self-time 42 ms despite a much smaller allocation drop than fast-refs-class or fast-dict-onebuf. The likely reason is that with all three shape-changes in place, V8 sees fully monomorphic call sites for PDFRef / PDFDict / PDFArray construction and method dispatch -- before the array change there was still one slow-property shape in the mix dragging IC perf. Confirmed by the cumulative process-time arc: baseline (fast-refs) 1.13 s 87 ms GC + fast-refs-class 0.99 s 82 ms GC + fast-dict-onebuf class shape 1.03 s 101 ms GC + fast-array-onebuf class shape 0.90 s 59 ms GC The dict-only state had a slight CPU regression (+40 ms vs refs-class) that the array change undid and then some. Ship the combo, not just the two big-row ones. Cumulative across the three commits in this round (baseline -> array-class): Process wall-clock 1.13 s -> 0.90 s (-230 ms, -20 %) Total sampled heap 45.26 MB -> 33.68 MB (-11.58 MB, -25.6 %) GC self-time 87 ms -> 59 ms (-32 %)

… round. The three commits that just landed (fast-refs-class, fast-dict-onebuf class shape, fast-array-onebuf class shape) did the same shape change to PDFRef, PDFDict, and PDFArray's factory paths -- swap `Object.create(proto) + property writes` for a plain-function constructor whose body assigns all fields in one shot, with the prototype aliased so `instanceof` + method dispatch stay unchanged. V8 gives the new instances a stable hidden class from the first one and per-instance heap cost drops from ~60 B to ~44 B. Each commit's narrative landed in notes/08-pdf-lib.md per the staging convention (perf/README.md stays light/operational; per- shim story lives in notes/). This commit folds in the closing context that ties the round together: - Per-instance savings table across all three wrappers in one view. - Investigation aside: how the `parseIndirectObjectHeader` 9 MB heap row was a V8 inlining-attribution artifact for fastOf's allocations downstream, confirmed via `node --no-turbo-inlining` paired run. The hand-inlined `fast-pioh.mjs` attempt was deleted after proving the negative; call-counting instrumentation in perf/instrument-pioh.mjs (arrives in the next commit). - Caveats: singleton subclass set in fast-dict-onebuf's dispatch, shared prototype semantics for `instanceof` / method dispatch, residual polymorphism on the gen != 0 PDFRef fallback. - Top-5 heap rows post-round + the next-step menu (wrapper elimination vs targeted smaller shrinks). The big rows are now at the per-instance floor for V8 objects with 1-2 inline fields. No code changes.

…shape round). Both scripts informed the constructor-shape PDFRef / PDFDict / PDFArray work that landed earlier in this round; capture them in the tree so future "is row X a real hot spot or a labelling artifact" investigations can reuse the pattern. `instrument-pioh.mjs` -- wraps PDFParser.parseIndirectObjectHeader + matchIndirectObjectHeader with counters, reports per-load call counts and kept-heap delta. Built to answer the prerequisites before committing to inline parseIndirectObjectHeader (was the function actually a hot spot? was fast-sync-load's digit short-circuit firing? was speculation throwing?). Output on the book: 226 k pioh calls, 0 throws, 0 mih calls, ~35 MB kept heap. The 9 MB self-attribution turned out to be V8 inlining fastOf into the caller frame -- script informed the correct attack surface (wrapper construction, not the parser body). `instrument-objclasses.mjs` -- two views of "how many PDF* wrappers do we build per load": - `.of()` call counts for the pooled / factory-method classes (PDFRef 1.43 M calls / 226 k unique, PDFName 1.68 M / 4.8 k, PDFNumber 284 k / 16 k, PDFString 7.4 k, PDFRawStream 2 k); - post-load walk of PDFContext.enumerateIndirectObjects() bumping per-runtime-class counters for the top-level shapes (PDFDict 221 k top-level / ~261 k incl. nested; 1 PDFCatalog; 238 PDFPageTree; 1 651 PDFPageLeaf). Used to size the constructor-shape attack: confirmed PDFRef + PDFDict + PDFArray dominate the per-instance × instance-count product, so they were the right three to convert. Both scripts updated to import fast-refs-class (current production) rather than the older fast-refs.mjs (now A/B baseline) so the numbers reflect production. Also adds README entries under "What's in this folder", next to the existing pdf-lib-side standalone harnesses (profile-load.mjs, profile-roundtrip.mjs).

…ss, -9 %). PDFObjectParser.prototype.parseName fires 1.68 M times per load on the book and the call density was the twinbasic#1 row in the process CPU profile after the constructor-shape round closed (PDFObjectParser.parseName @ ~87 ms self + fastOf @ ~57 ms callee = ~144 ms combined, ~16 % of process). 4 787 of those calls are unique; the other 99.7 % hit the same handful of dict keys (Type, Length, Pages, MediaBox, ...) over and over. The per-call work -- build a string via `name += charFromCode(byte)` then hand it to PDFName.of's string-keyed Map -- is pure overhead on the hot path: the answer was already cached, we just kept rebuilding the key. A failed first attempt (the v1 of this shim, not committed) tried to keep the cons-string accumulator but skip the per-byte this.bytes.peek/.next/.done method dispatch. CPU didn't move: V8 had already optimised the cons-string path well and the saved method-call cost just shifted to attribution on the callers (fastParseDictOneBuf / fastParseObject) under inlining. A second failed sketch built the lookup string via `String.fromCharCode.apply(null, buf.subarray(start, idx))` and was SLOWER than upstream (~123 ms vs ~87 ms): .apply on a typed array is a deopt path in V8. This shim attacks the real surface: avoid producing the lookup string at all on the hot path. Scan bytes with direct buffer access, compute a Java-style `hash * 31 + byte` Smi hash in the same pass, look up `Map<hash, Entry | Entry[]>` keyed by byte content. Single-entry buckets (the vast majority -- 4 787 unique names into 2^32 hash space gives essentially zero collisions) store the Entry directly; collision buckets get promoted to an Entry[] for linear scan. Entry carries a small Uint8Array copy of the name body for exact equality check. Cold path: byte-cache miss. Build the lookup string in one shot (String.fromCharCode is fast for short ranges via direct args, no .apply), call the upstream `PDFName.of` -- which on this stack means fast-decode-name's string-keyed Map -- and cache the returned PDFName in the byte-cache for next time. Both caches converge on the same PDFName per logical name; direct PDFName.of(...) calls from non-parser code (setOutline, setMetadata) bypass the byte-cache and go straight through fast-decode-name -- correct, no byte range available. Measured on the book (paired heap + cpu profile, baseline = prior --fast-refs-class + --fast-dict-onebuf + --fast-array-onebuf state, this on top): Process wall-clock 0.90 s -> 0.82 s (-80 ms, -9 %) load 0.41 s -> 0.33 s save 0.42 s -> 0.42 s parseName + fastOf (combined) 144 ms -> 58 ms (-86 ms) PDFObjectParser.parseName (gone from top 15) fastOf (PDFName decode-name) 52 ms -> (gone from top 15) Heap (sampled total) 33.68 MB -> 34.98 MB (+1.30 MB) new fastParseName row — -> 1 269 KB (the cache) set (builtin) 624 KB -> 852 KB (+228 KB) Heap-profiled process wall-clock dropped much more (3.50 s -> 2.56 s, -940 ms) than the cpu-profiled run did -- because the sampler's per-allocation bookkeeping is the dominant cost under 512 B sampling, and we just eliminated ~1.6 M transient string allocations that were all under the sample threshold (so they don't show up in the heap row, only in the sampler's wall-clock overhead). Read the cpu number for "did we get faster"; read the heap-row delta for "what's the long-lived cost" (+1.3 MB, the cache itself). GC self-time +21 ms on the cpu run (the live byte-cache adds to mark cost), more than offset by the -80 ms parseName savings. Output PDF byte-identical: the byte-cache and the string-cache return the same PDFName instance per logical name, so all downstream code sees the same identity.

…e, -10 %).

d layout shifts from start[0:23] / length[24:37] to start[0:22] / gap[23:24] / length[25:40], freeing bits 23 and 24 for PDFPageLeaf's normalized + autoNormalizeCTM. _FastPageLeaf collapses to a single d field; the booleans become prototype getters/setters that mask in/out of d. start drops 24 -> 23 bits (8.4 M slots, still well above the ~2.3 M mainLen on the book); length grows 14 -> 16 bits (65 535, ample headroom over the 8 706 observed max). Heap saving on the 1 651 page leaves is sub-row at the 512 B sampler resolution but real (~26 KB). Output byte-identical to baseline. CPU flat (no PDFPageLeaf mutation paths fire on the book).

Split _FastRef into two constructors: _FastRef(objectNumber) for the gen=0 path (every PDFRef on fresh-Chrome workloads) carrying a single inline data field, and _FastRefGen(objectNumber, generationNumber) for the rare gen!=0 path (the xref free entry at object 0). A default `generationNumber = 0` on PDFRef.prototype supplies the missing field for _FastRef instances via prototype lookup, so reads of either property stay as plain data-property loads -- no accessor- property boundary that would deopt the IC at upstream call sites (PDFCrossRefSection.append, PDFCrossRefStream entry tuples, PDFWriter.serializeToBuffer, fast-indirect-objects, ...). Per-instance: 24 B (two slots) -> 16 B (one slot). On 226 k unique PDFRefs, ~1.88 MB heap saved (paired heap profile: 34.96 MB -> 33.08 MB total sampled). CPU min flat (no-profile wall-clock 0.70 s vs ~0.83 s pre, but the heap-saving lane isn't the source of CPU movement). PDF output byte-identical. An accessor-property variant (single packed d + getters for objectNumber/generationNumber) was tried first and rejected: it regressed heap by +1.6 MB and CPU by +70 ms because the getter dispatch broke V8's monomorphic ICs at the upstream xref-write sites, forcing recompilation paths that couldn't elide the {ref, offset, deleted} object literals in PDFCrossRefSection.addEntry as aggressively.

Companion to analyze-heap.mjs / find-heap-callers.mjs. Prints every match for a substring on functionName, with the matched frame's self-size and each direct child's self + descendant total. Built during the PDFRef class-shape round to investigate why `maybeParseCrossRefSection` showed 3.4 MB self with <40 KB worth of named children -- a V8 inlining-attribution case where the parent frame's compiled code absorbed `PDFCrossRefSection.addEntry`'s object-literal allocations. The flat top-15 view can't tell you that; this script can.

KubaO added 30 commits May 23, 2026 23:48

Use node's built-in zlib instead of pako, saves >1s from the PDF build.

470917c

Speed up pdf loading by ~0.3s.

c423231

Parallelize deflate when saving the pdf.

272413d

Use zlib in larger chunks for deflation, also use it for inflation.

4ecf184

Speed up pdf-lib's number parsing.

20b17d6

Improve performance of a few helper functions.

79aa2a1

Speed up size-in-bytes, add previous patches to measure.mjs.

cbd1534

Improve iterator performance.

4515a8f

Synchronify pdf-lib's load + save paths, pin pdf-lib + puppeteer.

3cf4743

Dispatch parseObject by first byte; gate true/false/null matchKeyword…

d928c2d

… scans.

Add sampling-heap-profile instrumentation for the process phase.

3bde613

Add find-heap-callers.mjs: attribute heap allocations to direct callers.

8939530

Update the README.

13c3adb

KubaO added 14 commits May 24, 2026 01:57

pipeline-deflate: overlap buffer-build with libuv deflate (-47 ms sav…

02aa7fb

…e, -10 %).

KubaO merged commit e3f4b40 into twinbasic:main May 24, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Squeeze performance out of the (post)processing stage of PDF build.#154

Squeeze performance out of the (post)processing stage of PDF build.#154
KubaO merged 44 commits into
twinbasic:mainfrom
KubaO:staging

KubaO commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KubaO commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant